Keywords: digitalization, foreign trade activity, international trade, Uzbekistan regions

1. The objective of this research

The goal of this study is to examine the impact of digitalization on the regions of the Republic of Uzbekistan, as well as the consequences for their foreign trade activity.

2. Literature review

Li, Zhang, and Zhang (2023) investigates the influence of the digital economy on inter-regional trade within China in their article1. The authors employ a panel data analysis of China’s provinces from 2006 to 2017. Their key arguments are:

The study acknowledges that the digital economy’s impact on resource allocation and technological innovation in promoting inter-regional trade requires further exploration. Although this article analyses the impact of digitalization on inter-regional trade, methodology and findings is similar to this case.

At the practical application level, the digital economy is based on digital technology, and the use of information and communication technology, digital platforms, and blockchain technology may drastically lower trade costs2. Some researchers have also investigated the possibility of the digital economy to improve bilateral trade and greatly increase domestic trade flows by developing a comprehensive digital economy index3. Given the great degree of overlap between the Internet and digitalization, researchers have investigated the Internet’s influence on domestic and international trade. Furthermore, several studies have discovered that the digital economy drives greater growth in local commerce than foreign trade, and the potential for improving inter-regional connection remains enormous.

3. Theory

Compared to statistical approaches using cross-sectional data, panel data analysis allows researchers to achieve a substantially better level of statistical validity in data analysis and evaluation of models using more complex research designs. Panel data analysis has been utilized in many social science research papers and journal articles as a result of these benefits, and in recent years, academics have begun to employ it more and more.

At its core, panel data allows us to control for unobserved heterogeneity – persistent characteristics of individuals that aren’t directly captured by the data but can influence the dependent variable. For example, in this case, regions have their own unique characterstics that have impact on foreign trade turnover, such as geographical location, culture. Panel data helps address this by including an individual fixed effect (α_i) in the model. This effect captures all the time-invariant characteristics of individual ‘i’ that influence income.

As noted by Baltagi (2005)4, one of the fixed effects model’s main advantages is its capacity to account for bias resulting from missing variables. It is possible to effectively handle unobserved heterogeneity, which might potentially obscure the connection between variables in a simple regression.

The mathematical representation of the fixed effects model:

\[Y_it = α_i + \beta X_it + ε_it\]

But there are also drawbacks to the model. Wooldridge (2010)5 asserts that because time-invariant independent factors are differenced out along with the individual fixed effect, fixed effects models are unable to evaluate the impact of these variables.

Despite these drawbacks, fixed effects models have numerous usages in many different industries. When examining the effect of work training on earnings, labor economists utilize them to account for individual abilities. In a similar way, political scientists may use them to examine how various policies affect voting patterns while taking into consideration unreported regional variables.

There is more to panel data theory than just fixed effects. The random effects model, which presupposes that each individual impact is random and uncorrelated with the independent variables, was developed by researchers such as Arellano and Bond (1991)6. This makes it possible to estimate the impacts of independent variables that change over time as well as those that don’t.

4. DataSet Details

The dataset is constructed by the author based on the statistical data from Stat.uz (STATISTICS AGENCY UNDER THE PRESIDENT OF THE REPUBLIC OF UZBEKISTAN). Panel dataset consists of cross-sectional data, including economic data of Uzbekistan’s regions from 2015 to 2022.

There is a total of 112 observations on 14 regions and 12 variables.

5. Data structure

This panel dataset is balanced and it has these variables:

All variables are quantitative.

6. Descriptive statistics

library(corrplot)
library(readxl)
mydata <- read_excel("./DATA/data.xlsx", 
    sheet = "Main")
head(mydata)
## # A tibble: 6 × 18
##   Region      Year Export Import   GCI    IS   APC   NCI    IA    AW    TT    TB
##   <chr>      <dbl>  <dbl>  <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 The Repub…  2015   95.4   191.  116   501. 32266  9372  22.2 1133.  286. -95.3
## 2 The Repub…  2016  428.    200.  106.  526. 36322 10692  23.9 1276.  628. 228. 
## 3 The Repub…  2017  462.    179.  116.  607. 41125 12594  25   1553.  641. 284. 
## 4 The Repub…  2018  458.    284.  105.  768. 45483 15431  24.9 1949.  742  174. 
## 5 The Repub…  2019  431.    269.  112.  922. 50945 17788  24.7 2322.  700  162. 
## 6 The Repub…  2020  363.    176.  108. 1072. 53251 21882  19.7 2437   539  186. 
## # ℹ 6 more variables: TNRS <dbl>, NRS <dbl>, NRSI <dbl>, SHPC <dbl>,
## #   SHLN <dbl>, NCLN <dbl>
str(mydata)
## tibble [112 × 18] (S3: tbl_df/tbl/data.frame)
##  $ Region: chr [1:112] "The Republic of Karakalpakstan" "The Republic of Karakalpakstan" "The Republic of Karakalpakstan" "The Republic of Karakalpakstan" ...
##  $ Year  : num [1:112] 2015 2016 2017 2018 2019 ...
##  $ Export: num [1:112] 95.4 428.3 462.4 457.8 431.1 ...
##  $ Import: num [1:112] 191 200 179 284 269 ...
##  $ GCI   : num [1:112] 116 106 116 105 112 ...
##  $ IS    : num [1:112] 501 526 607 768 922 ...
##  $ APC   : num [1:112] 32266 36322 41125 45483 50945 ...
##  $ NCI   : num [1:112] 9372 10692 12594 15431 17788 ...
##  $ IA    : num [1:112] 22.2 23.9 25 24.9 24.7 ...
##  $ AW    : num [1:112] 1133 1276 1553 1949 2322 ...
##  $ TT    : num [1:112] 286 628 641 742 700 ...
##  $ TB    : num [1:112] -95.3 228.2 283.7 173.6 162.2 ...
##  $ TNRS  : num [1:112] 1203 1202 1296 1263 1332 ...
##  $ NRS   : num [1:112] 67.7 66.6 70.8 68 70.7 72.1 73.7 76.4 60.6 61.5 ...
##  $ NRSI  : num [1:112] 66.2 65.4 69.6 66.6 68.6 69.9 70.2 71.9 59.7 60.7 ...
##  $ SHPC  : num [1:112] 40.6 50.9 53.9 59.9 64.7 72.2 54.2 36.2 34.6 37.7 ...
##  $ SHLN  : num [1:112] 4 4.5 4.9 4.6 4.5 2.9 2.5 5.7 1.8 1.9 ...
##  $ NCLN  : num [1:112] 10864 12260 14379 16216 18852 ...

Variables


Numerical (quantitative, discrete): Export, TB
Numerical (quantitative, continuous): GCI, IS, APC,NCI, TT, AW, IA

Categorical (qualitative, nominal): -
Categorical (qualitative, ordinal): Region, Year

summary(mydata)
##     Region               Year          Export           Import       
##  Length:112         Min.   :2015   Min.   :  76.2   Min.   :   50.9  
##  Class :character   1st Qu.:2017   1st Qu.: 198.9   1st Qu.:  285.6  
##  Mode  :character   Median :2018   Median : 313.9   Median :  532.7  
##                     Mean   :2018   Mean   : 617.8   Mean   : 1350.7  
##                     3rd Qu.:2020   3rd Qu.: 509.9   3rd Qu.: 1247.5  
##                     Max.   :2022   Max.   :4669.3   Max.   :14939.7  
##       GCI              IS              APC              NCI        
##  Min.   :104.3   Min.   : 212.4   Min.   : 17331   Min.   :  3961  
##  1st Qu.:109.9   1st Qu.: 542.1   1st Qu.: 36562   1st Qu.: 10798  
##  Median :114.2   Median : 874.2   Median : 50846   Median : 16408  
##  Mean   :116.1   Mean   :1147.6   Mean   : 68217   Mean   : 28802  
##  3rd Qu.:121.1   3rd Qu.:1380.4   3rd Qu.: 62679   3rd Qu.: 26039  
##  Max.   :143.1   Max.   :5385.0   Max.   :381051   Max.   :236242  
##        IA              AW             TT                TB           
##  Min.   : 7.60   Min.   : 943   Min.   :  158.3   Min.   :-10270.40  
##  1st Qu.:17.20   1st Qu.:1341   1st Qu.:  554.6   1st Qu.:  -743.48  
##  Median :20.05   Median :2107   Median :  852.2   Median :  -243.15  
##  Mean   :21.41   Mean   :2370   Mean   : 1968.5   Mean   :  -732.95  
##  3rd Qu.:24.93   3rd Qu.:2942   3rd Qu.: 1624.4   3rd Qu.:   -52.15  
##  Max.   :44.50   Max.   :9149   Max.   :19609.0   Max.   :   283.70  
##       TNRS             NRS              NRSI             SHPC      
##  Min.   : 523.6   Min.   : 20.20   Min.   : 19.10   Min.   :15.00  
##  1st Qu.:1055.3   1st Qu.: 59.15   1st Qu.: 57.60   1st Qu.:38.02  
##  Median :1470.8   Median : 66.05   Median : 64.45   Median :47.15  
##  Mean   :1760.8   Mean   : 74.68   Mean   : 70.12   Mean   :47.73  
##  3rd Qu.:2075.1   3rd Qu.: 74.15   3rd Qu.: 70.12   3rd Qu.:57.40  
##  Max.   :7691.9   Max.   :264.40   Max.   :197.20   Max.   :79.70  
##       SHLN             NCLN       
##  Min.   : 1.300   Min.   :  5777  
##  1st Qu.: 3.175   1st Qu.: 13314  
##  Median : 4.500   Median : 17414  
##  Mean   : 4.507   Mean   : 28234  
##  3rd Qu.: 5.525   3rd Qu.: 22024  
##  Max.   :11.300   Max.   :233004

Summary:

  • Year - Dataset provides information from 2015 to 2022

  • Export - The minimum export value of 14 regions is 75.2 mln US dollars. And on average, regions exported products worth of 617.8 million. But more than more than 50% of regions indicated 313.9 million US dollar export, which means there are potential outliers in the dataset

  • Import - Mean value of import shows that regions imported more products than they export on average (1350.7 million US dollars).

  • Growth rates in communication and information services (GCI) vary between 104.3% and 143.1%, with a mean growth rate of 116.1%.

  • IS - The number of internet subscribers shows substantial variation, with a minimum of 212.4 thousand and a maximum of 5385.0 thousand. The mean number of internet subscribers is 1147.6 thousand.

  • APC - The minimum APC is 17331 units, while the maximum is 381051 units.

  • NCI - The range for NCI is from a minimum of 3961 units to a maximum of 236242 units.

  • TT - ranges from 158.3 to 19609.0 million US dollars, with a mean of 1968.5 million US dollars.

  • TB - ranges from -10270.40 to 283.70 million US dollars, with a mean trade deficit of -732.95 million US dollars. A negative trade balance suggests a higher value of imports than exports on average in these regions

# Boxplot for Export and Import
boxplot(mydata$Export, mydata$Import, names = c("Export", "Import"), col = c("skyblue", "lightcoral"), main = "Boxplot of Export and Import")

Boxplot shows that there are many outlier regions in both export and import idicators and median tends to lower side of the candle.

boxplot(mydata$AW, mydata$IS, names = c("Average monthly wages", "Number of subscribers with Internet access"), col = c("skyblue", "lightcoral"), main = "Boxplot of AW and IS")

7. Correlation analysis

Correlation between two variables reveals if they are significantly correlated and provides information about the strength and direction of the linear relationship between them.

The correlation coefficient (r) can range from -1 to 1.

If r = 1, it means there is a complete positive correlation between variables.
If r equals 0, there is no linear connection between the two variables. If r = -1, then there is a complete negative correlation between the variables.

corr <- cor(mydata[c(-1, -2)])
corr
##            Export     Import         GCI         IS        APC        NCI
## Export  1.0000000  0.9464173  0.43072055  0.7820418  0.9137079  0.9126492
## Import  0.9464173  1.0000000  0.35501911  0.8204491  0.9395345  0.9528416
## GCI     0.4307205  0.3550191  1.00000000  0.3961613  0.3286583  0.3782498
## IS      0.7820418  0.8204491  0.39616131  1.0000000  0.8789623  0.8878817
## APC     0.9137079  0.9395345  0.32865826  0.8789623  1.0000000  0.9862589
## NCI     0.9126492  0.9528416  0.37824976  0.8878817  0.9862589  1.0000000
## IA      0.4429925  0.4170457 -0.10351610  0.2056495  0.4978939  0.4693396
## AW      0.6200894  0.6694345  0.53166658  0.7973131  0.6369413  0.7226999
## TT      0.9703829  0.9964019  0.37860103  0.8189549  0.9425880  0.9522642
## TB     -0.8790801 -0.9859167 -0.30096647 -0.8060181 -0.9136071 -0.9337969
## TNRS    0.7873466  0.8432048  0.29553383  0.9377916  0.9126194  0.8931389
## NRS     0.7746907  0.8427551  0.25744768  0.7914671  0.8946952  0.9159824
## NRSI    0.6884985  0.7294981  0.18039578  0.7160722  0.8410490  0.8427889
## SHPC    0.1064967  0.1351762 -0.36378357  0.1473343  0.2642021  0.1826870
## SHLN    0.3861876  0.4206250  0.02493716  0.2659371  0.4742133  0.4843789
## NCLN    0.9074301  0.9423575  0.32067204  0.8486574  0.9870688  0.9869327
##                 IA          AW         TT         TB       TNRS        NRS
## Export  0.44299253  0.62008938  0.9703829 -0.8790801  0.7873466  0.7746907
## Import  0.41704572  0.66943448  0.9964019 -0.9859167  0.8432048  0.8427551
## GCI    -0.10351610  0.53166658  0.3786010 -0.3009665  0.2955338  0.2574477
## IS      0.20564953  0.79731310  0.8189549 -0.8060181  0.9377916  0.7914671
## APC     0.49789394  0.63694131  0.9425880 -0.9136071  0.9126194  0.8946952
## NCI     0.46933963  0.72269994  0.9522642 -0.9337969  0.8931389  0.9159824
## IA      1.00000000  0.02105853  0.4282192 -0.3861638  0.2757321  0.5338615
## AW      0.02105853  1.00000000  0.6634895 -0.6669847  0.6327353  0.6378146
## TT      0.42821917  0.66348946  1.0000000 -0.9681954  0.8373688  0.8337110
## TB     -0.38616379 -0.66698466 -0.9681954  1.0000000 -0.8368587 -0.8427488
## TNRS    0.27573209  0.63273533  0.8373688 -0.8368587  1.0000000  0.8303386
## NRS     0.53386153  0.63781462  0.8337110 -0.8427488  0.8303386  1.0000000
## NRSI    0.59758190  0.51350612  0.7263718 -0.7202141  0.7602879  0.9624468
## SHPC    0.54339374 -0.12751777  0.1290641 -0.1443733  0.2115460  0.1704800
## SHLN    0.57457595  0.16881941  0.4159888 -0.4208632  0.3501945  0.4648262
## NCLN    0.51627327  0.64165712  0.9430521 -0.9210248  0.8897700  0.9266619
##              NRSI       SHPC        SHLN       NCLN
## Export  0.6884985  0.1064967  0.38618759  0.9074301
## Import  0.7294981  0.1351762  0.42062501  0.9423575
## GCI     0.1803958 -0.3637836  0.02493716  0.3206720
## IS      0.7160722  0.1473343  0.26593705  0.8486574
## APC     0.8410490  0.2642021  0.47421334  0.9870688
## NCI     0.8427889  0.1826870  0.48437893  0.9869327
## IA      0.5975819  0.5433937  0.57457595  0.5162733
## AW      0.5135061 -0.1275178  0.16881941  0.6416571
## TT      0.7263718  0.1290641  0.41598877  0.9430521
## TB     -0.7202141 -0.1443733 -0.42086321 -0.9210248
## TNRS    0.7602879  0.2115460  0.35019453  0.8897700
## NRS     0.9624468  0.1704800  0.46482618  0.9266619
## NRSI    1.0000000  0.2197228  0.43891184  0.8672869
## SHPC    0.2197228  1.0000000  0.15434208  0.1926237
## SHLN    0.4389118  0.1543421  1.00000000  0.5454536
## NCLN    0.8672869  0.1926237  0.54545358  1.0000000
corrplot(corr, order = "FPC")

# Ensure the dataframe columns are numeric for the correlation calculation
numeric_data <- mydata[sapply(mydata, is.numeric)]

# Reorder the columns to move TT, TB, Export, and Import to the start
required_cols <- c("TT", "TB", "Export", "Import")
remaining_cols <- setdiff(names(numeric_data), required_cols)
mydata_reordered <- numeric_data[, c(required_cols, remaining_cols)]

# Calculate the correlation matrix
corr <- cor(mydata_reordered, use = "complete.obs")

# Create the correlation plot with hierarchical clustering order
corrplot(corr)

Apart from GCI, all variables significantly correlated with TT, but there is multicollinearity problem with these variables. To fix it let’s create single Digitalization index which can explain the dataset.

library(factoextra)
library(tidyverse)

pca_data <- mydata %>% select(GCI, IS, APC, NCI, AW, IA)

# PCA
pca_result <- prcomp(pca_data, scale = TRUE)
pca_result2 <- princomp(pca_data, cor = TRUE)


summary(pca_result2)
## Importance of components:
##                          Comp.1    Comp.2     Comp.3    Comp.4     Comp.5
## Standard deviation     1.954070 1.1326041 0.75789785 0.4891632 0.28306936
## Proportion of Variance 0.636398 0.2137987 0.09573486 0.0398801 0.01335471
## Cumulative Proportion  0.636398 0.8501967 0.94593151 0.9858116 0.99916632
##                              Comp.6
## Standard deviation     0.0707252098
## Proportion of Variance 0.0008336759
## Cumulative Proportion  1.0000000000
summary(pca_result)
## Importance of components:
##                           PC1    PC2     PC3     PC4     PC5     PC6
## Standard deviation     1.9541 1.1326 0.75790 0.48916 0.28307 0.07073
## Proportion of Variance 0.6364 0.2138 0.09573 0.03988 0.01335 0.00083
## Cumulative Proportion  0.6364 0.8502 0.94593 0.98581 0.99917 1.00000
fviz_eig(pca_result, addlabels = TRUE)

#fviz_cos2(pca_result, choice = "var", axes = 1:3)

The table of Importance of components and Scree plot indicate that PC1 can explain 63% of the dataset so this component is can be used as Digitalization index.

principal_components <- pca_result$x
selected_components <- principal_components[, 1]

pca_scores <- selected_components %*% t(principal_components[, 1])

# Combine components to create the PCA index
pca_index <- rowSums(pca_scores)

min_range <- 0
max_range <- 1
scaled_pca_index <- scales::rescale(pca_index, to = c(min_range, max_range))
nd <- mydata
nd$Digix <- scaled_pca_index

#tail(nd[c(-4, -3)], 20)
corm <- cor(nd[c(-1, -2)]); corm
##            Export     Import         GCI         IS        APC        NCI
## Export  1.0000000  0.9464173  0.43072055  0.7820418  0.9137079  0.9126492
## Import  0.9464173  1.0000000  0.35501911  0.8204491  0.9395345  0.9528416
## GCI     0.4307205  0.3550191  1.00000000  0.3961613  0.3286583  0.3782498
## IS      0.7820418  0.8204491  0.39616131  1.0000000  0.8789623  0.8878817
## APC     0.9137079  0.9395345  0.32865826  0.8789623  1.0000000  0.9862589
## NCI     0.9126492  0.9528416  0.37824976  0.8878817  0.9862589  1.0000000
## IA      0.4429925  0.4170457 -0.10351610  0.2056495  0.4978939  0.4693396
## AW      0.6200894  0.6694345  0.53166658  0.7973131  0.6369413  0.7226999
## TT      0.9703829  0.9964019  0.37860103  0.8189549  0.9425880  0.9522642
## TB     -0.8790801 -0.9859167 -0.30096647 -0.8060181 -0.9136071 -0.9337969
## TNRS    0.7873466  0.8432048  0.29553383  0.9377916  0.9126194  0.8931389
## NRS     0.7746907  0.8427551  0.25744768  0.7914671  0.8946952  0.9159824
## NRSI    0.6884985  0.7294981  0.18039578  0.7160722  0.8410490  0.8427889
## SHPC    0.1064967  0.1351762 -0.36378357  0.1473343  0.2642021  0.1826870
## SHLN    0.3861876  0.4206250  0.02493716  0.2659371  0.4742133  0.4843789
## NCLN    0.9074301  0.9423575  0.32067204  0.8486574  0.9870688  0.9869327
## Digix   0.8839045  0.9066404  0.51109787  0.9347378  0.9365947  0.9640342
##                 IA          AW         TT         TB       TNRS        NRS
## Export  0.44299253  0.62008938  0.9703829 -0.8790801  0.7873466  0.7746907
## Import  0.41704572  0.66943448  0.9964019 -0.9859167  0.8432048  0.8427551
## GCI    -0.10351610  0.53166658  0.3786010 -0.3009665  0.2955338  0.2574477
## IS      0.20564953  0.79731310  0.8189549 -0.8060181  0.9377916  0.7914671
## APC     0.49789394  0.63694131  0.9425880 -0.9136071  0.9126194  0.8946952
## NCI     0.46933963  0.72269994  0.9522642 -0.9337969  0.8931389  0.9159824
## IA      1.00000000  0.02105853  0.4282192 -0.3861638  0.2757321  0.5338615
## AW      0.02105853  1.00000000  0.6634895 -0.6669847  0.6327353  0.6378146
## TT      0.42821917  0.66348946  1.0000000 -0.9681954  0.8373688  0.8337110
## TB     -0.38616379 -0.66698466 -0.9681954  1.0000000 -0.8368587 -0.8427488
## TNRS    0.27573209  0.63273533  0.8373688 -0.8368587  1.0000000  0.8303386
## NRS     0.53386153  0.63781462  0.8337110 -0.8427488  0.8303386  1.0000000
## NRSI    0.59758190  0.51350612  0.7263718 -0.7202141  0.7602879  0.9624468
## SHPC    0.54339374 -0.12751777  0.1290641 -0.1443733  0.2115460  0.1704800
## SHLN    0.57457595  0.16881941  0.4159888 -0.4208632  0.3501945  0.4648262
## NCLN    0.51627327  0.64165712  0.9430521 -0.9210248  0.8897700  0.9266619
## Digix   0.37652597  0.83461014  0.9101608 -0.8804884  0.8888484  0.8708837
##              NRSI       SHPC        SHLN       NCLN      Digix
## Export  0.6884985  0.1064967  0.38618759  0.9074301  0.8839045
## Import  0.7294981  0.1351762  0.42062501  0.9423575  0.9066404
## GCI     0.1803958 -0.3637836  0.02493716  0.3206720  0.5110979
## IS      0.7160722  0.1473343  0.26593705  0.8486574  0.9347378
## APC     0.8410490  0.2642021  0.47421334  0.9870688  0.9365947
## NCI     0.8427889  0.1826870  0.48437893  0.9869327  0.9640342
## IA      0.5975819  0.5433937  0.57457595  0.5162733  0.3765260
## AW      0.5135061 -0.1275178  0.16881941  0.6416571  0.8346101
## TT      0.7263718  0.1290641  0.41598877  0.9430521  0.9101608
## TB     -0.7202141 -0.1443733 -0.42086321 -0.9210248 -0.8804884
## TNRS    0.7602879  0.2115460  0.35019453  0.8897700  0.8888484
## NRS     0.9624468  0.1704800  0.46482618  0.9266619  0.8708837
## NRSI    1.0000000  0.2197228  0.43891184  0.8672869  0.7840094
## SHPC    0.2197228  1.0000000  0.15434208  0.1926237  0.1155613
## SHLN    0.4389118  0.1543421  1.00000000  0.5454536  0.4035076
## NCLN    0.8672869  0.1926237  0.54545358  1.0000000  0.9308264
## Digix   0.7840094  0.1155613  0.40350758  0.9308264  1.0000000
corrplot(corm)

Based on the correlation matrix above, it is clear that the index we created has significant relation (91%) with Trade turnover of the region

library(ggplot2)
# Calculate mean Digix value for each region
region_digix <- aggregate(Digix ~ Region, data = nd, FUN = mean)

# Plotting the digitalization index over years for different regions
ggplot(data=nd, aes(x = Year, y = Digix, color = Region)) +
  geom_line() +
  labs(x = "Year", y = "Digitalization Index", title = "Digitalization Index over Time by Region") +
  theme_minimal()

ggplot(nd, aes(x = Digix, y = Export)) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE, color = "red") +
  labs(title = "Scatter Plot of Digix and Export",
       x = "Digix",
       y = "Export") +
  theme_minimal()

boxplot(TT~Year, data = nd)

library(plm)
pdata <- pdata.frame(nd, index = c("Region", "Year"))
# Scatter plot for each region (TT)
ggplot(pdata, aes(x = Digix, y = TT, color = as.factor(Region))) +
  geom_point() +
  labs(title = "Scatter Plot of TT vs Digix by Region",
       x = "Digix",
       y = "TT")

# Scatter plot for each region (Export)
ggplot(pdata, aes(x = Digix, y = Export, color = as.factor(Region))) +
  geom_point() +
  labs(title = "Scatter Plot of Export vs Digix by Region",
       x = "Digix",
       y = "Export")

# Scatter plot for each region (Export)
ggplot(pdata, aes(x = Digix, y = Import, color = as.factor(Region))) +
  geom_point() +
  labs(title = "Scatter Plot of Import vs Digix by Region",
       x = "Digix",
       y = "Import")

library(gplots)
plotmeans(TT~Region , main = 'Heterogeneity Across Regions', data = pdata)

plotmeans(TT~Year , main = 'Heterogeneity Across Years', data = pdata)

These graphs suggest that heterogeneity differs between distinct units.

Pooled Model

\(H_0\): Pooled OLS is not the appropriate estimator

\(H_1\): Pooled OLS is the appropriate estimator

pooledOls <- plm(TT ~ Digix, data = pdata , model = "pooling")
summary(pooledOls)
## Pooling Model
## 
## Call:
## plm(formula = TT ~ Digix, data = pdata, model = "pooling")
## 
## Balanced Panel: n = 14, T = 8, N = 112
## 
## Residuals:
##      Min.   1st Qu.    Median   3rd Qu.      Max. 
## -3712.148  -559.065    68.143   721.669  3995.789 
## 
## Coefficients:
##             Estimate Std. Error t-value  Pr(>|t|)    
## (Intercept)  -914.80     177.35 -5.1581 1.113e-06 ***
## Digix       18653.68     809.50 23.0434 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Total Sum of Squares:    1134100000
## Residual Sum of Squares: 194620000
## R-Squared:      0.82839
## Adj. R-Squared: 0.82683
## F-statistic: 530.998 on 1 and 110 DF, p-value: < 2.22e-16

\[TT_i = \beta_0 + \beta_1 \cdot Digix_{it} + u_{it}\] \[ TT = −914.80 + 18653.68 \cdot Digix_{it} + u_{it}\]

Summary:

  • P-Value: - For the overall model p-value less than 5%. It indicates that there is strong evidence against the null hypothesis. So, we can reject the null hypothesis and accept the alternative hypothesis.
  • Intercepts are also significant as p values are less than 5%.
  • R-Squared: 83% of data can be understood with this model

Pooled OLS ignores heterogeneity totally, therefore even if it has a P-Value of less than 5% and R2, showing high overall accuracy and suitability to use, we cannot rely on this model.

# Scatter plot for each region
ggplot(pdata, aes(x = Digix, y = TT, color = as.factor(Region))) +
  geom_point() +
  labs(title = "Scatter Plot of TT vs Digix by Region",
       x = "Digix",
       y = "TT")

Least Squares Dummy Variable Model (LSDV)

fixed.dum <-lm(TT ~ Digix + factor(Region)-1, data = pdata)
summary(fixed.dum)
## 
## Call:
## lm(formula = TT ~ Digix + factor(Region) - 1, data = pdata)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3386.9  -252.8    64.6   373.3  3681.4 
## 
## Coefficients:
##                                              Estimate Std. Error t value
## Digix                                        10460.03    1100.96   9.501
## factor(Region)Andijan region                  1215.25     338.30   3.592
## factor(Region)Bukhara region                  -265.52     341.05  -0.779
## factor(Region)Fergana region                  -231.08     357.73  -0.646
## factor(Region)Jizzakh region                  -429.10     327.24  -1.311
## factor(Region)Kashkadarya region              -630.22     342.79  -1.838
## factor(Region)Khorezm region                  -755.59     337.52  -2.239
## factor(Region)Namangan region                 -328.42     337.79  -0.972
## factor(Region)Navoi region                     -26.23     339.15  -0.077
## factor(Region)Samarkand region                -259.60     365.86  -0.710
## factor(Region)Surkhandarya region             -529.35     329.62  -1.606
## factor(Region)Syrdarya region                 -157.13     321.28  -0.489
## factor(Region)Tashkent city                   5467.59     761.54   7.180
## factor(Region)Tashkent region                 2390.05     366.45   6.522
## factor(Region)The Republic of Karakalpakstan  -536.92     336.32  -1.596
##                                              Pr(>|t|)    
## Digix                                        1.61e-15 ***
## factor(Region)Andijan region                 0.000517 ***
## factor(Region)Bukhara region                 0.438148    
## factor(Region)Fergana region                 0.519839    
## factor(Region)Jizzakh region                 0.192865    
## factor(Region)Kashkadarya region             0.069053 .  
## factor(Region)Khorezm region                 0.027462 *  
## factor(Region)Namangan region                0.333341    
## factor(Region)Navoi region                   0.938506    
## factor(Region)Samarkand region               0.479685    
## factor(Region)Surkhandarya region            0.111540    
## factor(Region)Syrdarya region                0.625882    
## factor(Region)Tashkent city                  1.42e-10 ***
## factor(Region)Tashkent region                3.14e-09 ***
## factor(Region)The Republic of Karakalpakstan 0.113645    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 889.2 on 97 degrees of freedom
## Multiple R-squared:  0.9511, Adjusted R-squared:  0.9435 
## F-statistic: 125.7 on 15 and 97 DF,  p-value: < 2.2e-16

Here each region is regarded as a dummy variable for estimation.

  • Overall model is good as the p-value is less than 5%.

  • Different regions contribute differently to Total Foreign Trade Turnover, as reflected by the fixed effects. Andijan, Khorezm, Tashkent City, and Tashkent Region, compared to others, have significant impact on Foreign Trade Turnover.

  • Multiple R-squared is 95% which is also more accurate than pooled OLS.

library(car)
yhat <- fixed.dum$fitted
Region <- pdata$Region
scatterplot(yhat ~ pdata$Digix| Region, xlab= "Regions", ylab ="yhat", smoother = quantregLine, legend = list( cex = 0.56, lty = 2))
abline(lm(TT~Digix, data = pdata),lwd=3, col="red")

This graph includes regression lines for 14 regression models of each region. The redline is the total regression line of LSVD model.

The complexity of LSDV increases with the number of parameters, including dummy variables, leading to less efficient estimators. That’s why we need to continue with next model.

Fixed Effects using Within Model

fixed <- plm(TT ~ Digix, data =pdata, index=c("Region", "Year"), model ="within")
summary(fixed)
## Oneway (individual) effect Within Model
## 
## Call:
## plm(formula = TT ~ Digix, data = pdata, model = "within", index = c("Region", 
##     "Year"))
## 
## Balanced Panel: n = 14, T = 8, N = 112
## 
## Residuals:
##      Min.   1st Qu.    Median   3rd Qu.      Max. 
## -3386.874  -252.766    64.561   373.296  3681.379 
## 
## Coefficients:
##       Estimate Std. Error t-value  Pr(>|t|)    
## Digix    10460       1101  9.5008 1.605e-15 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Total Sum of Squares:    148080000
## Residual Sum of Squares: 76701000
## R-Squared:      0.48202
## Adj. R-Squared: 0.40726
## F-statistic: 90.2651 on 1 and 97 DF, p-value: 1.6054e-15
  • P-Value: Overall model is good as the p-value less than 5%

  • The intercept values are significant as the p-value is less than 5%.

  • R2: Value of R2 is 0.48202, which is less than pooled OLS and LSVD, but more accurate than them.

fixef(fixed)
##                 Andijan region                 Bukhara region 
##                       1215.251                       -265.523 
##                 Fergana region                 Jizzakh region 
##                       -231.078                       -429.099 
##             Kashkadarya region                 Khorezm region 
##                       -630.215                       -755.593 
##                Namangan region                   Navoi region 
##                       -328.420                        -26.232 
##               Samarkand region            Surkhandarya region 
##                       -259.598                       -529.353 
##                Syrdarya region                  Tashkent city 
##                       -157.134                       5467.588 
##                Tashkent region The Republic of Karakalpakstan 
##                       2390.049                       -536.920

Here intercept values of each region is provided. They are similar to the values we got from LSVD model.

Random Effect Model Regression Analysis

random <- plm(TT ~ Digix, data=pdata, index=c("Region", "Year"), model="random")
summary(random)
## Oneway (individual) effect Random Effect Model 
##    (Swamy-Arora's transformation)
## 
## Call:
## plm(formula = TT ~ Digix, data = pdata, model = "random", index = c("Region", 
##     "Year"))
## 
## Balanced Panel: n = 14, T = 8, N = 112
## 
## Effects:
##                    var  std.dev share
## idiosyncratic 790727.6    889.2   0.6
## individual    527899.0    726.6   0.4
## theta: 0.6029
## 
## Residuals:
##      Min.   1st Qu.    Median   3rd Qu.      Max. 
## -2619.417  -409.508    64.639   478.480  4564.810 
## 
## Coefficients:
##             Estimate Std. Error z-value Pr(>|z|)    
## (Intercept)  -201.25     286.61 -0.7022   0.4826    
## Digix       14037.34    1019.04 13.7751   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Total Sum of Squares:    303580000
## Residual Sum of Squares: 111400000
## R-Squared:      0.63303
## Adj. R-Squared: 0.6297
## Chisq: 189.754 on 1 DF, p-value: < 2.22e-16
  • P-Value: Overall model is good as the p-value less than 5%

  • The Digix value is significant as the p-value is less than 5%, but intercept value has low significance

  • The value of R2 is 0.91129 and 0.63303, which is better than Fixed effects using within model

Tests

Breusch-Pagan (BP) tes (Graphical)

library(lmtest)
bptest(TT ~ Digix+factor(Region)+factor(Year), data=pdata, studentize=F)
## 
##  Breusch-Pagan test
## 
## data:  TT ~ Digix + factor(Region) + factor(Year)
## BP = 268.71, df = 21, p-value < 2.2e-16

We may reject the null hypothesis since the BP test’s P-value of less than 5% shows that the residual’s variance is changing as the predictor value rises. As a result, heteroscedasticity exists in the data.

Pooled OLS vs Fixed Effect Model

H0: Pooled OLS model is consistent

H1: Fixed effect model is consistent

pFtest(fixed,pooledOls)
## 
##  F test for individual effects
## 
## data:  TT ~ Digix
## F = 11.471, df1 = 13, df2 = 97, p-value = 1.545e-14
## alternative hypothesis: significant effects

The result of pFtest () reveals that the p-value is less than 5%, allowing us to reject the null hypothesis. Therefore, the fixed effect model is consistent.

Hausman Test

H0: Random Effect is consistent.

H1: Fixed Effect is consistent.

phtest(fixed, random)
## 
##  Hausman Test
## 
## data:  TT ~ Digix
## chisq = 73.678, df = 1, p-value < 2.2e-16
## alternative hypothesis: one model is inconsistent

Here, the P-value is less than 5%. It suggests that there is significant evidence against the null hypothesis, with less than a 5% chance that it is right. As a result, the fixed effect model is an appropriate estimator.

Two tests above proved that fixed effects model is best estimator, so let’s check for time-fixed effects

fixed.time <- plm(TT ~ Digix + factor(Year), data=pdata, index=c("Region",
"Year"), model="within")
#summary(fixed.time)
pFtest(fixed.time, fixed)
## 
##  F test for individual effects
## 
## data:  TT ~ Digix + factor(Year)
## F = 4.5119, df1 = 7, df2 = 90, p-value = 0.0002392
## alternative hypothesis: significant effects
plmtest(fixed, c("time"), type=("bp"))
## 
##  Lagrange Multiplier Test - time effects (Breusch-Pagan)
## 
## data:  TT ~ Digix
## chisq = 54.798, df = 1, p-value = 1.336e-13
## alternative hypothesis: significant effects

Both tests show lower p than alpha, which means we should add time-fixed effects to the model

Final Model

Based one results of tests above we construct a regression model with Time Fixed Effects here:

\[Y_{it} = \beta_0 + \beta_1 X_{it} + \delta_2 Dummy2 + \delta_T DummyT_t + u_{it}\] This model removes omitted variable bias by removing unobserved variables that change over time but remain constant between entities.

fm <- plm(TT ~ Digix + factor(Year) - 1, data =pdata, index=c("Region", "Year"), model ="within", effect = "twoways")
summary(fm)
## Twoways effects Within Model
## 
## Call:
## plm(formula = TT ~ Digix + factor(Year) - 1, data = pdata, effect = "twoways", 
##     model = "within", index = c("Region", "Year"))
## 
## Balanced Panel: n = 14, T = 8, N = 112
## 
## Residuals:
##     Min.  1st Qu.   Median  3rd Qu.     Max. 
## -3256.28  -323.53    34.96   355.82  3167.62 
## 
## Coefficients:
##       Estimate Std. Error t-value  Pr(>|t|)    
## Digix  19959.0     2122.2  9.4047 4.946e-15 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Total Sum of Squares:    112570000
## Residual Sum of Squares: 56776000
## R-Squared:      0.49565
## Adj. R-Squared: 0.37797
## F-statistic: 88.4481 on 1 and 90 DF, p-value: 4.9456e-15
#tlm <- lm(TT ~ Digix + Region + factor(Year) - 1, data =pdata)
#summary(tlm)

The final model is: \[TT_{it} = \beta_1 \cdot Digix_{it} + \alpha_i + \gamma_i + u_{it}\] * \(TT_{it}\) - it is the dependent variable for region i at time t.
* \(\beta_1\) - the coefficient for the variable \(Digix_{it}\)
* \(\alpha_i\) - represents the fixed effect for region i.
* \(\gamma_i\) - represents the fixed effect for time t.
* \(u_{it}\) - the error term.

\[TT_{it} = 19959.0 \cdot Digix_{it} + \alpha_i + \gamma_i + u_{it}\] The estimated fixed effects for each region are:

fixef(fm)
##                 Andijan region                 Bukhara region 
##                         973.75                        -569.83 
##                 Fergana region                 Jizzakh region 
##                        -867.43                        -376.28 
##             Kashkadarya region                 Khorezm region 
##                        -972.70                        -978.66 
##                Namangan region                   Navoi region 
##                        -558.02                        -287.40 
##               Samarkand region            Surkhandarya region 
##                       -1037.76                        -547.66 
##                Syrdarya region                  Tashkent city 
##                         108.26                         319.41 
##                Tashkent region The Republic of Karakalpakstan 
##                        1601.92                        -731.40

Summary - The panel data analysis with two-way fixed effects provides useful insights into the economic variables that has influence on the dependent variable (TT) across different locations and time.

The equation (\(TT_{it} = 19959.0 \times Digix_{it} + \alpha_i + \gamma_t + u_{it}\)) shows a strong positive correlation (91%) between the dependent variable (TT) and the digitalization. The estimated coefficient for (Digix) is 19959.0, which is statistically significant (p < 0.001).

This coefficient shows how one unit rise in the Digitalization Index (Digix) is expected to impact on region’s Foreign Trade Turnover (TT). In other terms, Foreign Trade Turnover of each region will rise by 19,959 million US dollars for every unit improvement in digitization. This indicates that the economic operations associated with international trade within the region benefits greatly from digitalization. Therefore, regions with higher digitalization are likely to witness a considerable boost in their trade activities.

The fixed effects provide these additional details:

  • The Tashkent region has a relatively higher fixed effect (1601.92), hence it has a positive impact on the dependent variable beyond the digitalization, and the reason of this might be because Tashkent region includes the capital of Uzbekistan and it is more developed than other regions.

  • Some regions, such as Fergana, Kashkadarya, and Samarkand, have negative fixed effects. They may experience lower (TT) in the overall trend of (Digix).

The (\(R^2\)) value of 0.49565 suggests that the model explains approximately 49.6% of the data.

In conclusion, the results indicate the importance of digitaliation (Digix) in foreign trade turnover of regions. This model can be considered while developing strategies to enhance the economic development and well-being of these areas. However, further research may be needed to address the specific challenges and opportunities of each region.

Future of the study

Although this study identifies a regional variation in the digital economy that affect foreign trade turnover of regions, there are significant limitations. Firstly, there is a need for more research because the current studies have not established a common framework to measure the digital economy and there might be discrepancies in the model. Secondly, the assessment of model was established exclusively using the data, which is limited to Uzbekistan’s regions, and may be less generalizable to other countries. This study’s empirical examination of the digital economy’s impact on inter-regional commerce is conducted at the macro level, and how it might be adapted to the more particular city level is an area for future research.

Cluster analysis

library (cluster) # кластерный анализ
library(GGally)
library(plotly)
# Remove non-numeric columns for clustering
maind <- mydata %>% filter(Year == 2022)
cd <- maind %>% select(GCI, IS, APC, NCI, IA, AW)
scaled_dt1 <- scale(cd)
fviz_nbclust(scaled_dt1, kmeans, method = "wss")

Perform K-means clustering:

library(ggrepel)
library(tidyr)
library(geojsonio)
library(tmaptools)
library(tmap)
library(sf)
library(leaflet)


# Data frame with number of clusters per year
yed <- data.frame(
  Year = 2015:2022,
  KC = c(3, 3, 3, 3, 3, 4, 3, 3)
)

# Loop through unique years
unique_years <- unique(mydata$Year)

for (year in unique_years) {
  # Subset data for the current year
  md <- mydata %>% filter(Year == year) %>% select(Year, GCI, IS, APC, NCI, IA, AW, TNRS, NRS, NRSI, SHPC, SHLN, NCLN)
  year_data <- mydata %>% filter(Year == year) %>% select(GCI, IS, APC, NCI, IA, AW, TNRS, NRS, NRSI, SHPC, SHLN, NCLN)
  
  # Check if data exists for the current year
  if (nrow(year_data) > 0) {
    # Scale the data
    scaled_dt <- scale(year_data)
    
    # Determine the number of clusters for the current year
    #ok <- fviz_nbclust(scaled_dt, kmeans, method = "wss")
    #print(ok)
    kc <- as.numeric(yed %>% filter(Year == year) %>% select(KC))
    cat("Number of clusters for year", year, ":", kc, "\n")
    
    # Perform k-means clustering
    set.seed(1) # Set seed for reproducibility
    kmeans_result <- kmeans(scaled_dt, centers = kc)
    
    # Add cluster assignments back to the original data
    mydata_clustered <- mydata %>% filter(Year == year) %>%
      mutate(Cluster = kmeans_result$cluster)
    
    # Extract the first two principal components for plotting
    pca_result <- prcomp(scaled_dt)
    pc_data <- data.frame(pca_result$x[, 1:2])
    pc_data <- cbind(pc_data, mydata_clustered)
    
    # Plotting the clusters
    plot_title <- paste("Cluster Analysis Result for Year", year)
    p <- ggplot(pc_data, aes(x = PC1, y = PC2, color = as.factor(Cluster))) +
      geom_point() +
      geom_text_repel(aes(label = Region), size = 3) +
      labs(title = plot_title, color = "Cluster") +
      theme_minimal()
    
    print(p)
    
        final_data <- cbind(md, Cluster = as.factor(kmeans_result$cluster))
    #print(final_data)
        # Parallel coordinate plot
    p2 <- ggparcoord(data = final_data, columns = c(2:13), groupColumn = "Cluster", scale = "center") + labs (x = "Variables", У = "Values", title = "Cluster Analysis based on k means")
    
    # Make the plot interactive
    ggplotly(p2)
    print(p2)
    
    a <- aggregate(year_data, by=list(cluster=as.factor(kmeans_result$cluster)), mean)
    print(a)
   
  } else {
    cat("No data available for year", year, "\n")
  }
}
## Number of clusters for year 2015 : 3

##   cluster      GCI        IS       APC        NCI       IA       AW     TNRS
## 1       1 120.8667  416.7000  35410.83   8800.000 21.73333 1063.433 1148.083
## 2       2 112.1000 2465.3000 245874.00 113251.000 37.70000 2256.000 4534.400
## 3       3 117.0143  481.9286  39461.43   8265.143 16.08571 1077.829 1320.457
##         NRS  NRSI     SHPC     SHLN      NCLN
## 1  61.85000  60.3 45.23333 5.116667  12205.17
## 2 190.30000 182.2 55.40000 7.000000 129588.00
## 3  55.31429  54.2 38.18571 3.385714  12077.57
## Number of clusters for year 2016 : 3

##   cluster    GCI        IS      APC       NCI       IA       AW    TNRS    NRS
## 1       1 109.26  510.8100  39783.4   9996.00 20.50000 1167.360 1254.49  55.85
## 2       2 108.90  553.4333  45624.0  13247.67 28.83333 1119.167 1329.40  67.60
## 3       3 121.70 2858.3000 266061.0 131654.00 42.80000 2650.600 4732.30 196.50
##        NRSI  SHPC     SHLN      NCLN
## 1  54.93000 43.81 4.630000  13817.10
## 2  65.53333 60.10 5.966667  15272.67
## 3 190.10000 57.40 7.300000 141477.00
## Number of clusters for year 2017 : 3

##   cluster     GCI       IS       APC      NCI      IA       AW     TNRS    NRS
## 1       1 127.500 3169.700 280899.00 144896.0 44.5000 3181.700 4990.000 204.10
## 2       2 119.240  461.900  37614.40  10817.4 20.5200 1520.300  994.720  53.86
## 3       3 115.225  711.125  48106.75  13934.5 24.4125 1405.638 1567.612  65.75
##       NRSI    SHPC   SHLN      NCLN
## 1 197.2000 60.3000 7.1000 151742.00
## 2  52.9600 37.8800 5.4400  15579.60
## 3  64.2375 52.1375 4.7375  16842.25
## Number of clusters for year 2018 : 3

##   cluster      GCI        IS       APC       NCI      IA      AW     TNRS
## 1       1 107.9344  865.6375  53024.88  16498.12 26.3375 1825.05 1526.625
## 2       2 110.0588  607.3400  41670.80  13446.00 19.9600 1839.72 1028.680
## 3       3 123.5120 3359.9000 297347.00 158788.00 43.2000 4212.70 4598.100
##        NRS   NRSI   SHPC   SHLN      NCLN
## 1  63.9375  62.25 60.975 5.2625  19015.62
## 2  52.1600  51.16 40.320 5.1600  16662.40
## 3 184.9000 178.20 60.300 8.8000 166057.00
## Number of clusters for year 2019 : 3

##   cluster      GCI       IS       APC       NCI       IA       AW     TNRS
## 1       1 110.3500  744.200  42953.00  15305.50 19.40000 2112.575 1020.175
## 2       2 111.2889 1095.422  57858.33  19603.89 24.26667 2171.344 1699.411
## 3       3 105.8000 3550.600 320189.00 175760.00 39.50000 4953.000 4471.300
##         NRS      NRSI     SHPC     SHLN      NCLN
## 1  57.25000  55.37500 38.80000 4.625000  16568.00
## 2  67.76667  65.74444 60.03333 4.722222  20276.22
## 3 176.00000 166.80000 70.00000 7.900000 168112.00
## Number of clusters for year 2020 : 4

##   cluster      GCI       IS       APC        NCI       IA       AW     TNRS
## 1       1 116.4000  774.800  23537.67   9445.333 14.23333 2672.033 1015.767
## 2       2 143.1000 1255.200  81215.00  36050.000 19.40000 2181.800 1405.100
## 3       3 132.0000 4620.900 319517.00 175926.000 30.70000 5512.000 5126.000
## 4       4 111.1889 1308.944  60371.22  22400.111 20.40000 2455.278 1821.433
##         NRS      NRSI     SHPC     SHLN      NCLN
## 1  64.93333  63.20000 34.90000 2.466667   7480.00
## 2  47.60000  45.00000 45.20000 3.400000  27305.00
## 3 194.00000 155.50000 58.60000 6.800000 163233.00
## 4  71.04444  69.06667 64.47778 3.133333  18173.33
## Number of clusters for year 2021 : 3

##   cluster      GCI       IS       APC      NCI       IA       AW     TNRS
## 1       1 127.1667 1200.431  52321.33  26488.0 11.67503 3552.265 1412.967
## 2       2 132.5000 4275.405 352751.00 207782.0 25.74264 6851.202 6323.400
## 3       3 115.6700 1511.046  59642.80  25168.7 16.79413 3014.503 1846.010
##      NRS   NRSI  SHPC  SHLN      NCLN
## 1  61.30  57.40 30.60 2.500  17931.33
## 2 224.10 154.90 48.20 5.700 184062.00
## 3  75.49  70.11 49.94 2.303  18370.40
## Number of clusters for year 2022 : 3

##   cluster      GCI       IS       APC       NCI       IA       AW     TNRS
## 1       1 127.7102 1727.343  57252.71  29890.29 12.84286 4133.171 1944.571
## 2       2 125.1214 1541.209  67650.83  37061.33 21.36667 4175.209 1779.667
## 3       3 130.1430 5384.981 381051.00 236242.00 28.60000 9148.741 7691.900
##         NRS      NRSI     SHPC      SHLN      NCLN
## 1  74.38571  68.21429 25.34286  3.757143  24126.71
## 2  77.05000  71.76667 40.86667  5.733333  27783.00
## 3 264.40000 157.60000 43.00000 11.300000 233004.00
# Load required packages
library(sf)
library(leaflet)

# Read GeoJSON file
uzbekistan_geojson <- "./uzbekistan_regional.geojson"
uzbekistan <- sf::st_read(uzbekistan_geojson)
## Reading layer `uzbekistan_regional' from data source 
##   `/Users/javlon/University/Big Data/FW/Independent work/uzbekistan_regional.geojson' 
##   using driver `GeoJSON'
## Simple feature collection with 14 features and 3 fields
## Geometry type: GEOMETRYCOLLECTION
## Dimension:     XY
## Bounding box:  xmin: 55.9989 ymin: 37.18212 xmax: 73.14812 ymax: 45.59051
## Geodetic CRS:  WGS 84
# Replace names
uzbekistan$ADM1_EN[uzbekistan$ADM1_EN == "Kashkadarya province"] <- "Kashkadarya region"
uzbekistan$ADM1_EN[uzbekistan$ADM1_EN == "Republic of Karakalpakstan"] <- "The Republic of Karakalpakstan"

str(uzbekistan)
## Classes 'sf' and 'data.frame':   14 obs. of  4 variables:
##  $ ADM1_EN : chr  "Tashkent city" "Namangan region" "Tashkent region" "Fergana region" ...
##  $ ADM1_RU : chr  "г. Ташкент" "Наманганская область" "Ташкентская область" "Ферганская область" ...
##  $ ADM1_UZ : chr  "Toshkent sh." "Namangan viloyati" "Toshkent viloyati" "Fargʻona viloyati" ...
##  $ geometry:sfc_GEOMETRYCOLLECTION of length 14; first list element: List of 1
##   ..$ :List of 2
##   .. ..$ :List of 1
##   .. .. ..$ : num [1:935, 1:2] 69.1 69.1 69.1 69.1 69.1 ...
##   .. ..$ :List of 1
##   .. .. ..$ : num [1:88, 1:2] 69.2 69.2 69.2 69.2 69.2 ...
##   .. ..- attr(*, "class")= chr [1:3] "XY" "MULTIPOLYGON" "sfg"
##   ..- attr(*, "class")= chr [1:3] "XY" "GEOMETRYCOLLECTION" "sfg"
##  - attr(*, "sf_column")= chr "geometry"
##  - attr(*, "agr")= Factor w/ 3 levels "constant","aggregate",..: NA NA NA
##   ..- attr(*, "names")= chr [1:3] "ADM1_EN" "ADM1_RU" "ADM1_UZ"
uzbekistan$ADM1_EN 
##  [1] "Tashkent city"                  "Namangan region"               
##  [3] "Tashkent region"                "Fergana region"                
##  [5] "Andijan region"                 "Syrdarya region"               
##  [7] "Jizzakh region"                 "Navoi region"                  
##  [9] "Samarkand region"               "Kashkadarya region"            
## [11] "Surkhandarya region"            "Bukhara region"                
## [13] "Khorezm region"                 "The Republic of Karakalpakstan"
# Subset data for the selected year
year_data_subset <- mydata %>% filter(Year == 2022) %>% select(Region)

year_data_a <- mydata %>% filter(Year == 2022) %>% select(GCI, IS, APC, NCI, IA, AW, TNRS, NRS, NRSI, SHPC, SHLN, NCLN)

# Scale the data
scaled_data <- scale(year_data_a)  # Exclude Year column for scaling

# Perform k-means clustering
set.seed(1) # Set seed for reproducibility
kmeans_result <- kmeans(scaled_data, centers = 3)

# Add cluster assignments back to the original data
year_data_clustered <- year_data_subset %>%
  mutate(Cluster = kmeans_result$cluster)

# Merge Digix values with the GeoJSON data
uzbekistan <- merge(uzbekistan, year_data_clustered, by.x = "ADM1_EN", by.y = "Region", all.x = TRUE)

# Convert geometry to MULTIPOLYGON
uzbekistan <- st_cast(uzbekistan, "MULTIPOLYGON")

# Define colors for each cluster
cluster_colors <- c("#e9493e", "#00bb38", "#629dff")

# Plot the map
leaflet(data = uzbekistan) %>%
  addProviderTiles("OpenStreetMap.Mapnik") %>%
  addPolygons(fillColor = ~cluster_colors[Cluster],
              fillOpacity = 0.8,
              stroke = FALSE,
              label = ~paste(ADM1_EN, ": Cluster", Cluster)) %>%
  addLegend("bottomright", 
            colors = cluster_colors,
            labels = c("Cluster 1", "Cluster 2", "Cluster 3"),
            title = "Clusters",
            opacity = 1)

Cluster 1

Cluster 1 comprises regions characterized by moderate development in the context of digitalization, particularly within the realm of communication and information services. The growth rate for communication and information services (GCI) in this cluster stands at 127.80%, indicating a steady but not exceptional expansion in this sector. The number of subscribers with Internet access (IS) in these regions is relatively low, at 1677.471 thousand, suggesting limited penetration of Internet services among the population. This limitation could be attributed to various factors, including inadequate infrastructure, lower levels of digital literacy, or economic constraints preventing wider access to Internet services.

Furthermore, the availability of personal computers (APC) in enterprises and organizations is moderate, with 56815.88 units reported. This level of availability indicates that while there are computers present in the business sector, their distribution is not extensive, potentially impacting productivity and digital engagement within these organizations. The number of computers connected to the Internet (NCI) is 29377.5 units, reflecting a similar trend of moderate connectivity. The share of enterprises and organizations with Internet access (IA) is relatively low at 13.46%, pointing to significant room for improvement in integrating digital tools and online connectivity in business operations. The average monthly nominal wages in the information and communication sector (AW) are 4121.696 thousand soums, which is on the lower end, suggesting that the sector might not be as financially rewarding in these regions, potentially influencing the attractiveness of careers in this field.

Cluster 2

Regions in Cluster 2 exhibit slightly more advanced digitalization metrics compared to Cluster 1, but they still fall short of the highest standards. The growth rate for communication and information services (GCI) in this cluster is 124.47%, slightly lower than in Cluster 1, yet still indicative of a growing sector. The number of subscribers with Internet access (IS) is 1583.778 thousand, which is comparable to Cluster 1, reinforcing the notion that Internet penetration remains an area needing further development. This similarity in Internet subscription numbers suggests that these regions might face similar barriers to digital adoption as those in Cluster 1, such as infrastructural limitations or socio-economic factors.

The availability of personal computers (APC) in enterprises and organizations is higher in Cluster 2, with 70429.40 units, indicating better provision of digital tools within the business sector. This increased availability likely supports enhanced operational efficiency and digital engagement among businesses. Correspondingly, the number of computers connected to the Internet (NCI) is higher at 39316.0 units, reflecting improved connectivity and integration of digital resources. The share of enterprises and organizations with Internet access (IA) in Cluster 2 is 22.08%, demonstrating a notable improvement over Cluster 1. This higher percentage suggests that more businesses are leveraging online tools and platforms to enhance their operations and market reach. The average monthly nominal wages in the information and communication sector (AW) are slightly higher at 4201.977 thousand soums, indicating a marginally more lucrative sector, which could potentially attract more talent and investment into this field.

Cluster 3

Cluster 3 represents the most advanced regions in terms of digitalization and development of communication and information services. The growth rate for communication and information services (GCI) in this cluster is the highest at 130.14%, highlighting robust expansion and significant investments in this sector. The number of subscribers with Internet access (IS) is markedly higher at 5384.981 thousand, indicating extensive penetration and widespread access to Internet services among the population. This high level of Internet subscription suggests well-developed infrastructure, higher digital literacy, and greater economic capability to support widespread Internet use.

The availability of personal computers (APC) in enterprises and organizations is exceptionally high at 381051.0 units, reflecting a substantial provision of digital tools within the business sector. This extensive availability supports advanced operational capabilities and fosters a high level of digital engagement and productivity. Similarly, the number of computers connected to the Internet (NCI) is the highest among all clusters at 236242.0 units, indicating comprehensive connectivity and integration of digital resources. The share of enterprises and organizations with Internet access (IA) is also the highest at 28.60%, demonstrating widespread adoption of online tools and platforms, which likely enhances business operations and competitiveness. The average monthly nominal wages in the information and communication sector (AW) are significantly higher at 9148.741 thousand soums, making this sector highly lucrative and likely attracting top talent and substantial investment. This high wage level suggests a vibrant and dynamic sector that is central to the economic activities in these regions, further driving digital advancement and economic growth.

References


  1. Li, M.; Zhang, L.; Zhang, Z. Impact of Digital Economy on Inter-Regional Trade: An Empirical Analysis in China. Sustainability 2023, 15, 12086. https://doi.org/10.3390/su151512086↩︎

  2. Abeliansky, A.L.; Hilbert, M. Digital technology and international trade: Is it the quantity of subscriptions or the quality of data speed that matters? Telecommun. Policy 2017, 41, 35–48.↩︎

  3. He, S.; Zhao, J.; Zhang, R. Development level of digital economy, trade costs and trade in value-added. Int. Econ. Trade Res. 2021, 11, 4–19.↩︎

  4. Baltagi, B. H. (2005). Econometric analysis of panel data (Vol. 3). John Wiley & Sons.↩︎

  5. Wooldridge, J. M. (2010). Econometric analysis of cross section and panel data (2nd ed.). MIT press. [2]↩︎

  6. Arellano, M., & Bond, S. (1991). Some tests of specification for panel data: Monte Carlo evidence. Review of Economic Studies, 58(2), 277-297. [3]↩︎